Linear regression is the statistical method for fitting a line to data where the relationship between two variables, x and y, can be modeled by a straight line with some error
\(\beta_0\) = y-intercept
\(\beta_1\) = slope
\(\epsilon\) = error term (a.k.a residual)
Introduction to linear regression
When we use \(x\) to predict \(y\), we usually call:
\(x\) the explanatory, predictor or independent variable,
\(y\) the response or dependent variable
As we move forward in this chapter, we will learn about:
criteria for line-fitting
the uncertainty associated with estimates of model parameters
Introduction to linear regression
Introduction to linear regression
World’s most useless regression
Opossums…?
x <-read.csv("possum.csv")head(x)
site pop sex age head_l skull_w total_l tail_l
1 1 Vic m 8 94.1 60.4 89.0 36.0
2 1 Vic f 6 92.5 57.6 91.5 36.5
3 1 Vic f 6 94.0 60.0 95.5 39.0
4 1 Vic f 6 93.2 57.1 92.0 38.0
5 1 Vic f 2 91.5 56.3 85.5 36.0
6 1 Vic f 1 93.1 54.8 90.5 35.5
World’s most useless regression
lm1 <-lm(head_l ~ total_l, x)coefficients(lm1)
(Intercept) total_l
42.7097931 0.5729013
World’s most useless regression
summary(x$total_l)
Min. 1st Qu. Median Mean 3rd Qu. Max.
75.00 84.00 88.00 87.09 90.00 96.50
Residuals are the leftover variation after accounting fo the model fit
The residual for the \(i^{th}\) observation (\(x_i, y_i\)) is the difference of the observed response (\(y_i\)) and the response that we would predict based on the model fit (\(\hat{y_i}\))
Normal residuals: Generally, the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points
Constant variability: The variability of points around the least squares line remains roughly constant
Independent observations: Be cautious about applying regression to time series data, which are sequential observations in time such as a stock price each day
You might recall the point-slope form of a line from math class, which we can use to find the model fit, including the estimate of \(\beta_0\)
\[y - \bar{y} = \beta_1 \times (x - \bar{x})\]
To find the y-intercept, set \(x = 0\)
\[b_0 = \bar{y} - \beta_1 \bar{x}\]
Least squares regression
Intercept
(mean_y - b1*mean_x)/1000
family_income
24.31979
coefficients(lm2)[1]; b0
(Intercept)
24.31933
(Intercept)
24.31933
Least squares regression
Interpretation: Slope
What do these regression coefficients mean?
The slope describes the estimated difference in the \(y\) variable if the explanatory variable \(x\) was one unit larger.
For each additional $1,000 of family income, we would expect a student to receive a net difference of $1,000×(−0.0431) = −$43.10 in aid on average, i.e. $43.10 less.
Least squares regression
Interpretation: Intercept
What do these regression coefficients mean?
The intercept describes the average outcome of \(y\) if \(x = 0\) and the linear model is valid all the way to \(x = 0\), which in many applications is not the case.
The estimated intercept \(\beta_0 = 24,319\) describes the average aid if a student’s family had no income.
We must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables.
Least squares regression
Extrapolation
Applying a model estimate to values outside of the realm of the original data is called extrapolation.
If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.
as.numeric(b0 + b1*1000)*1000
[1] -18752.32
Least squares regression
Extrapolation
Least squares regression
Strength of a fit: R-squared
We evaluated the strength of the linear relationship between two variables earlier using the correlation, \(R\).
However, it is more common to explain the strength of a linear fit using \(R^2\)
If provided with a linear model, we might like to describe how closely the data cluster around the linear fit.
About 25% of the variation in gift aid can be explained by differences in family income.
Least squares regression
Example 3: Categorical Predictors
x <-read.csv("mariokart.csv")y <-data.frame(new =ifelse(x$cond =="new", 1, 0),price = x$total_pr)summary(y)
new price
Min. :0.0000 Min. : 28.98
1st Qu.:0.0000 1st Qu.: 41.17
Median :0.0000 Median : 46.50
Mean :0.4126 Mean : 49.88
3rd Qu.:1.0000 3rd Qu.: 53.99
Max. :1.0000 Max. :326.51
We might wonder, is this convincing evidence that the “true” linear model has a negative slope?
That is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election?
\(H_0: \beta_1 = 0\)
\(H_0: \beta_1 \neq 0\)
Inference for linear regression
Example: Midterm elections & unemployment
sd_x <-sd(x$unemp)sd_y <-sd(x$house_change)n <-nrow(x)z <-list()for(i in1:10000){set.seed(i) y <-data.frame(x =rnorm(n, mean =0, sd = sd_x),y =rnorm(n, mean =0, sd = sd_y) ) z[[length(z)+1]] <-coefficients(lm(y ~ x, y))[2]}z <-unlist(z)dz <-density(z)
Call:
lm(formula = gift_aid ~ family_income, data = x)
Residuals:
Min 1Q Median 3Q Max
-10.1128 -3.6234 -0.2161 3.1587 11.5707
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.31933 1.29145 18.831 < 2e-16 ***
family_income -0.04307 0.01081 -3.985 0.000229 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.783 on 48 degrees of freedom
Multiple R-squared: 0.2486, Adjusted R-squared: 0.2329
F-statistic: 15.88 on 1 and 48 DF, p-value: 0.0002289
Introduction to Multiple Regression
Multiple regression extends simple two-variable regression to situations with one response variable and multiple predictors (denoted \(x_1\), \(x_2\), \(x_3\), …).
It is motivated by cases where several factors may simultaneously influence an outcome.
A multiple regression model is a linear model with multiple predictors. In general, we write the model as… where there are \(k\) predictors.
income_ver mean sd n min max
1 Not Verified 11.09946 4.569159 3594 11.02324 11.17567
2 Source Verified 12.51548 4.735268 4116 12.44167 12.58929
3 Verified 14.35375 5.447896 2290 14.23990 14.46759
Multiple Regression: Bankruptcy Example
Regression Analysis — Reference Group
This regression output provides multiple rows for the income verification variable.
Each row represents the relative difference for each level of income_ver.
However, one level is missing: Not Verified.
The missing level is called the reference group, representing the default category that all other levels are measured against.
Multiple Regression: Bankruptcy Example
Regression Interpretation
The higher interest rate for borrowers who have verified their income is surprising.
Intuitively, we might expect verified income to make a loan less risky.
However, the situation may be more complex and could involve confounding variables that we haven’t accounted for.
For example, perhaps lenders require borrowers with poor credit to verify their income. In that case, income verification could signal concern about repayment rather than reassurance. This would make the borrower appear higher risk, leading to a higher interest rate.
Multiple Regression: Bankruptcy Example
Omitted Variable Bias
Omitted variable bias occurs when a model leaves out relevant variables, leading to biased or misleading estimates of the included predictors.
The effects of the missing variables are incorrectly attributed to those remaining in the model, distorting interpretation.
Multiple regression: Bankruptcy example
Multivariate regression analysis
lm3 <-lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = y)summary(lm3)
Call:
lm(formula = interest_rate ~ income_ver + debt_to_income + credit_util +
bankruptcy + term + issued + credit_checks, data = y)
Residuals:
Min 1Q Median 3Q Max
-12.1219 -3.0984 -0.7247 2.3318 18.8160
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.893839 0.210245 9.008 < 2e-16 ***
income_verSource Verified 0.997468 0.099187 10.056 < 2e-16 ***
income_verVerified 2.563168 0.117184 21.873 < 2e-16 ***
debt_to_income 0.021832 0.002937 7.434 1.14e-13 ***
credit_util 4.896988 0.161887 30.249 < 2e-16 ***
bankruptcy 0.391105 0.132286 2.957 0.00312 **
term 0.153388 0.003944 38.889 < 2e-16 ***
issuedJan-2018 0.045510 0.108034 0.421 0.67358
issuedMar-2018 -0.041624 0.106545 -0.391 0.69605
credit_checks 0.228321 0.018242 12.516 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.3 on 9964 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.2604, Adjusted R-squared: 0.2597
F-statistic: 389.7 on 9 and 9964 DF, p-value: < 2.2e-16
Multiple regression: Bankruptcy example
Multivariate regression analysis – Interpretation
We estimate the parameters \(\beta_0, \beta_1, \beta_2...,\beta_9\) that minimize the sum of the squared residuals:
Each coefficient represents the incremental change in interest rate for that level, relative to the Not Verified group, which serves as the reference level.
For example, a borrower whose income source and amount have been verified is predicted to have a 3.25 percentage point higher interest rate than a borrower whose income has not been verified.
The estimated intercept is 1.925, and one might be tempted to interpret this as the model’s predicted interest rate when all predictors equal zero — i.e., income source not verified, no debt, zero credit utilization, and so on.
term (the length of the loan in months) never equals zero — a loan with a term of 0 months would have to be repaid immediately… Therefore, in this context, the intercept has little practical meaning and should not be overinterpreted.
Multiple Regression: Bankruptcy Example
Collinearity
Including multiple predictors helps reduce or eliminate omitted variable bias, but another challenge can arise: correlation among predictors.
When two or more predictors are correlated, we say they are collinear.
This collinearity makes it difficult to disentangle each variable’s individual contribution to the response, and can complicate model estimation and interpretation.
The regular \(R^2\) often overstates how much variability the model explains, especially for new samples.
To obtain a more reliable measure of model fit, we use the adjusted \(R^2\), which penalizes unnecessary predictors and better reflects true explanatory power.
n <-nrow(y1)k <-length(coefficients(lm3)) -1adj_r2 <-1- (1- r2) * (n -1) / (n - k -1)adj_r2; summary(lm3)$adj.r.squared
[1] 0.2596924
[1] 0.2596924
Model Selection
The best model is not always the most complicated.
Including variables that are not truly important can actually reduce predictive accuracy.
In this section, we discuss model selection strategies that help eliminate variables contributing little to the model’s explanatory power.
Models that have undergone such variable pruning are often called parsimonious models — a term that simply means efficient and no more complex than necessary.
Model Selection
Our goal is to identify a smaller, more interpretable model that performs just as well—or even better than a “full” model.
Adjusted \(R^2\) measures the strength of a model’s fit while penalizing unnecessary predictors.
It helps evaluate which variables are truly adding value to the model—that is, improving its ability to predict future outcomes.
“All models are wrong, but some are useful”
Model Selection
Bankruptcy example
lm3 <-lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + issued + credit_checks, data = y)lm4 <-lm(interest_rate ~ income_ver + debt_to_income + credit_util + bankruptcy + term + credit_checks, data = y)summary(lm3)$adj.r.squared; summary(lm4)$adj.r.squared
Call:
lm(formula = price ~ cond_new + stock_photo + wheels, data = x1)
Residuals:
Min 1Q Median 3Q Max
-11.454 -2.959 -0.949 2.712 14.061
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.0483 0.9745 36.990 < 2e-16 ***
cond_new 5.1763 0.9961 5.196 7.21e-07 ***
stock_photo 1.1177 1.0192 1.097 0.275
wheels 7.2984 0.5448 13.397 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.884 on 137 degrees of freedom
Multiple R-squared: 0.719, Adjusted R-squared: 0.7128
F-statistic: 116.8 on 3 and 137 DF, p-value: < 2.2e-16
Multiple regression
Distribution of Residuals
Multiple regression
Model Check
Multiple regression
Model Check
Multiple regression
Model Check
Multiple regression
Model Check
Introduction to Logistic Regression
In this section, we introduce logistic regression as a modeling tool for cases where the response variable is categorical with two levels, such as Yes/No or Success/Failure.
Logistic regression is a type of generalized linear model (GLM) that is well-suited for response variables where ordinary multiple regression performs poorly.
In these settings, the residuals often deviate sharply from the normal distribution, violating key regression assumptions.
Introduction to Logistic Regression
Generalized linear models (GLMs) can be viewed as a two-stage modeling approach:
Model the response variable using an appropriate probability distribution (e.g., binomial or Poisson).
Model the parameter of that distribution (such as a probability or rate) as a function of predictors through a link function.
Logistic regression is a GLM used when the outcome variable is binary.
The outcome \(Y_i\) takes the value 1 with probability \(p_i\) and 0 with probability \(1 - p_i\).
Because each observation has a unique context (e.g., education, experience, etc.), the probability \(p_i\) can vary across individuals.
Introduction to Logistic Regression
Model Evaluation
We will assess model quality using the Akaike Information Criterion (AIC) — a measure that balances model fit and model simplicity.
Conceptually, AIC plays a similar role here to adjusted \(R^2\) in multiple regression, helping us compare models and penalize unnecessary complexity.
college_degree mean sd n min max
1 0 3.587116 18.60370 136600 2.600541 4.573690
2 1 5.936073 23.63323 350400 5.153550 6.718596
Audit Study
Barplot
Audit Study
Best model
lm3 <-glm(callback ~ ., data = x[,-3], family =binomial(link ="logit"))summary(lm3)
Call:
glm(formula = callback ~ ., family = binomial(link = "logit"),
data = x[, -3])
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.71616 0.15510 -17.513 < 2e-16 ***
chicago -0.43642 0.11406 -3.826 0.00013 ***
years_experience 0.02055 0.01015 2.024 0.04297 *
military -0.34426 0.21571 -1.596 0.11050
honors 0.76341 0.18525 4.121 3.77e-05 ***
email_address 0.22208 0.11301 1.965 0.04940 *
white 0.44291 0.10802 4.100 4.13e-05 ***
male -0.19591 0.13520 -1.449 0.14733
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2726.9 on 4869 degrees of freedom
Residual deviance: 2659.5 on 4862 degrees of freedom
AIC: 2675.5
Number of Fisher Scoring iterations: 5
Audit Study
Interpret Coefficients
## estimate the probability of receiving a callback for:### a job in Chicago### 14 years experience### no honors, no military experience,### email address and has a first name that implies they are a White malez <-as.data.frame(coefficients(lm3))z <-data.frame(var =row.names(z),coef = z[,1])z$x <-c(1, 1, 14, 0, 0, 1, 1, 1)coefs <-sum(z$coef * z$x)prob_callback <-exp(coefs) / (1+exp(coefs))prob_callback
[1] 0.0834939
Audit Study
Interpret Coefficients
## 3.7% more likely than random(prob_callback /mean(x$callback)) -1